Discussion

Goal: The goal of clustering is to split the observations into groups such that the observations within each group are similar and the groups are different from one another.

Hierarchical Clustering

In this type of clustering, we start with each observation as its own cluster (a leaf) and slowly combine them until we get to one cluster (a trunk). In the end, a dendrogram is used to help decide a good number of clusters to use.

Here is one example: image credit: http://varianceexplained.org/r/love-actually-network/

Moving up the tree, fuse similar leaves into branches. The more similar two leaves, the sooner their branches will fuse. The height of the first fusion between two cases’ branches measures the “distance” between them.

As illustrated by Allison Horst

Allison Horst is the Artist in Residence at RStudio - how cool is that?! The following artwork is done by her and can be found on her github page.

And now … my boring notes

Let’s look at an example. Recall the sample of iris data.

set.seed(15)
iris_sub <- iris %>% 
  sample_n(20) %>% 
  mutate(obs = 1:20)

iris_sub %>% 
  ggplot(aes(x=Sepal.Length, y=Petal.Length)) +
  geom_point() +
  geom_text(aes(label=obs),
            nudge_x = .06)

Let’s first look at the results, and we’ll talk about the details of the algorithm after.

iris_dist <- dist(iris_sub %>% 
                    select(Petal.Length,Sepal.Length))

#distance matrix - remove # below and run to see it
#iris_dist 

plot(hclust(iris_dist, "single"), 
     xlab = "")
abline(h = 1, col="darkred")
abline(h = 0, col="darkred")

QUESTIONS:

  1. Notice that observations 3 & 4 and 6 & 9 are fused at a height of just over zero. How are those two sets of observations similar in the scatterplot?
  2. Observations 12 and 5 are the last to be fused to another observation. How are those observations “different” in the scatterplot?
  3. The red lines I drew on the dendrogram represent different “stopping” places, meaning if we chose to stop the clustering there, the final clusters would be those branches it crosses through plus any single observations that have yet to be fused that lie above the line.

The algorithm:

  1. Start with all \(n\) observations as their own cluster.
  2. Compute the Euclidean distance among all \({n\choose{2}} = n(n-1)/2\) pairs of observations (use the dist() function in R).
  3. Fuse the observations that are closest. The distance between them is represented as the height they are fused on the dendrogram.
  4. Compute the Euclidean distance among the remaining clusters. This can be done a variety of ways. Two are discussed below.
  • Complete/maximal linkage: the distance between clusters with more than one observation is the maximal distance. That is, all pairwise distances are computed and the largest one is used.
  • Single/minimal linkage: the distance between clusters with more than one observation is the minimal distance. That is, all pairwise distances are computed and the smallest one is used.
  1. Repeat steps 3 and 4 until there is only one cluster, the trunk.

R functions

This is easy to do in R using the hclust() function. We need to give this function a “distance object”, which is a lower triangular matrix of Euclidean distances between all pairs of observations. The dist() function can create the distance object, and we just need to give that function our dataset with the pertinent variables. Notice I only kept the variables we are interested in using in clustering. If I keep more, it will change the distance object.

iris_dist <- dist(iris_sub %>% 
                    select(Petal.Length,Sepal.Length))

#distance matrix - remove # below and run to see it
#iris_dist 

iris_hclust <- hclust(iris_dist, method = "single")
iris_hclust
## 
## Call:
## hclust(d = iris_dist, method = "single")
## 
## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 20

Nothing terribly interesting comes from the regular output. The most interesting piece is the dendrogram.

plot(iris_hclust, xlab="Hierarchical Clustering based on Petal Length and Sepal Length")
abline(h = 1, col="darkred")
abline(h = 0, col="darkred")

We can also obtain the cluster labels for each observation at a specific cut of the dendrogram.

cutree(iris_hclust, h = 0)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
cutree(iris_hclust, h = 1)
##  [1] 1 2 1 1 3 2 2 2 2 1 2 4 2 1 2 2 1 1 2 2

Exercises

  1. Look at the output (rounded) from the dist() function that computes the Euclidean distance. Use that to help you create the first 5 fuses of the dendrogram “by hand”. Use the single linkage method
round(dist(iris_sub %>% select(Petal.Length, Sepal.Length)),2)
##       1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
## 2  5.70                                                                      
## 3  0.61 5.86                                                                 
## 4  0.51 5.81 0.10                                                            
## 5  3.26 3.42 3.10 3.10                                                       
## 6  3.83 2.19 3.86 3.83 1.25                                                  
## 7  3.65 2.27 3.70 3.67 1.26 0.22                                             
## 8  4.33 1.39 4.47 4.43 2.19 0.95 0.94                                        
## 9  3.82 2.27 3.83 3.81 1.17 0.10 0.28 1.04                                   
## 10 0.92 5.47 0.51 0.54 2.60 3.42 3.27 4.08 3.38                              
## 11 4.30 1.63 4.37 4.33 1.80 0.57 0.67 0.51 0.64 3.94                         
## 12 1.75 4.38 1.61 1.60 1.51 2.28 2.15 3.00 2.25 1.14 2.82                    
## 13 4.81 0.89 4.96 4.92 2.64 1.39 1.42 0.50 1.48 4.58 0.85 3.50               
## 14 0.45 5.26 0.85 0.76 2.91 3.41 3.22 3.89 3.41 0.92 3.86 1.43 4.37          
## 15 2.77 3.00 2.86 2.82 1.30 1.10 0.90 1.61 1.12 2.47 1.53 1.41 2.11 2.33     
## 16 4.08 1.66 4.20 4.16 1.93 0.71 0.67 0.28 0.81 3.81 0.42 2.72 0.78 3.64 1.34
## 17 0.85 5.78 0.28 0.36 2.91 3.73 3.58 4.39 3.70 0.32 4.25 1.46 4.89 1.00 2.78
## 18 0.41 5.37 0.58 0.50 2.84 3.45 3.28 3.99 3.44 0.63 3.93 1.33 4.48 0.30 2.40
## 19 3.48 2.24 3.62 3.58 1.70 0.78 0.58 0.85 0.86 3.24 0.92 2.19 1.34 3.04 0.78
## 20 3.24 2.64 3.29 3.26 1.10 0.60 0.41 1.27 0.61 2.86 1.08 1.75 1.77 2.82 0.51
##      16   17   18   19
## 2                     
## 3                     
## 4                     
## 5                     
## 6                     
## 7                     
## 8                     
## 9                     
## 10                    
## 11                    
## 12                    
## 13                    
## 14                    
## 15                    
## 16                    
## 17 4.12               
## 18 3.73 0.71          
## 19 0.61 3.55 3.14     
## 20 0.99 3.18 2.86 0.61
  1. Use the entire iris dataset, except the Species variable. Perform hierarchical clustering using both single linkage and complete linkage. What differences do you see? How many clusters would you choose? Does your “favorite” number of clusters change by method?

  2. Use the candy_rankings dataset from the fivethirtyeight library and cluster candies together using hierarchical clustering. Try both single and complete linkage. How many clusters would you choose? How do your results compare to the K-means clusters?

  3. Should I consider scaling my variables so that they are centered at 0 with a standard deviation of 1? Why?

  4. How might this method be challenging with a large dataset?